Extrinsic Corpus Evaluation with a Collocation Dictionary Task
نویسندگان
چکیده
The NLP researcher or application-builder often wonders “what corpus should I use, or should I build one of my own? If I build one of my own, how will I know if I have done a good job?” Currently there is very little help available for them. They are in need of a framework for evaluating corpora. We develop such a framework, in relation to corpora which aim for good coverage of ‘general language’. The task we set is automatic creation of a publication-quality collocations dictionary. For a sample of 100 headwords of Czech and 100 of English, we identify a gold standard dataset of (ideally) all the collocations that should appear for these headwords in such a dictionary. The datasets are being made available alongside this paper. We then use them to determine precision and recall for a range of corpora, with a range of parameters.
منابع مشابه
Methods for the Extraction of Hungarian Multi-Word Lexemes
This paper describes an experiment on extracting Hungarian multi-word lexemes from a corpus, using statistical methods. Corpus preparation—the addition of POS tags and stems—was done automatically. From the corpus, 〈verb+noun+casemark〉 patterns were extracted as collocation candidates. Evaluation shows that the statistical methods used by Villada Moirón (2004a) to identify Dutch V + PP collocat...
متن کاملVerb-Noun Collocation SyntLex Dictionary: Corpus-Based Approach
The project presented here is a part of a long term research program aiming at a full lexicon grammar for Polish (SyntLex). The main concern of this project is computer-assisted acquisition and morpho-syntactic description of verb-noun collocations in Polish. We present methodology and resources obtained in three main project phases which are: dictionary-based acquisition of collocation lexicon...
متن کاملBilingual Collocation Extraction Based on Syntactic and Statistical Analyses
In this paper, we describe an algorithm that employs syntactic and statistical analysis to extract bilingual collocations from a parallel corpus. The preferred syntactic patterns are obtained from idioms and collocations in a machine-readable dictionary. Phrases matching the patterns are extract from aligned sentences in a parallel corpus. Those phrases are subsequently matched up via cross-lin...
متن کاملComputational Metalexicography in Practice - Corpus-based support for the . . .
Computational Metalexicography in Practice { Corpus-based support for the revision of a commercial dictionary Abstract In a cooperation between dictionary publishers and computational linguists, raw material for the revision of the German part of a bilingual German ! English dictionary (Langenscheidts Handww orterbuch Englisch, Neubearbeitung 1991) was produced. In a case study, the entries for...
متن کاملThe Sense Boundary Decision and the Sense Labeling from Collocation Clustering
This paper discusses the deciding practical sense boundary of homonymous words. One of the serious problems in making dictionaries or thesauri is in the vague boundary of senses. This also becomes a bottleneck in sense disambiguation for practical language processing systems. This paper proposes a deciding method for sense boundary discovery of homonyms using collocation from large corpora and ...
متن کامل